Dyr og Data

Getting data into R

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2025-08-31

Learning outcomes

In this class you’ll learn:

  1. how to create small data sets in R using vectors and data frames,

  2. how to import data from external files into R,

  3. how to best organise data in files.

Creating datasets

Creating data sets in R

If you have small data sets, it will often be quickest to just create the data by typing it into a script

This is especially so if the data will only be used within this script

If the data are

  • larger than 10-20 observations and 1–2 variables, or
  • will be used in more than a single script

prepare the data in a file and import them into R

A simple data set

We have two Devon Rex cats, Hansi and Apricot, who we weigh regularly

Here are the last 10 observations for Hansi:

Weight (kg) Weight (kg)
5.65 5.55
5.25 5.40
5.65 5.50
5.35 5.55
5.45 5.25

Hansi’s weight is barely enough to activate our human digital scales, so we weigh ourselves with and without holding him, so there’s a lot of noise in these weights.

Entering data into R

The simplest way to get working with these data in R is just to enter them as a vector

We use the c() function to combine values into a vector

hansi <- c(5.65, 5.25, 5.65, 5.35, 5.45, 5.55, 5.40, 5.50, 5.55, 5.25)

Tip

Remember to separate each value with a comma , and space out the values to make them easier to read

Using the data we entered

Now we can use the data like any other vector

If we wanted to know Hansi’s average (mean) weight of the most recent weighings, we could use mean()

mean(hansi)
[1] 5.46

We see that Hansi’s average weight is 5.46kg.

More than one variable

We will often be working with more than one variable

In this case we also have the observations of Apricot’s weight for her last 10 weighings

Weight (kg) Weight (kg)
3.15 3.35
3.40 3.05
3.20 3.40
3.40 3.25
3.50 3.20

More than one variable

We can enter Apricot’s data just as we did for Hansi

apricot <- c(3.15, 3.40, 3.20, 3.40, 3.50, 3.35, 3.05, 3.40, 3.25, 3.20)

and calculate her average weight

mean(apricot)
[1] 3.29

Keeping vectors together

The weights for Hansi and Apricot were observed at the same observation times

the first weight for Hansi was recorded at the same time as the first weight for Apricot

Instead of working with separate vectors, we can store the vectors in a data frame

cats <- data.frame(
    hansi = hansi,
    apricot = apricot
)

This is how we’ll typically encounter data throughout this course

Data frames

There is a separate video all about data frames

But simply, we can think of a data frame as R’s equivalent of an Excel worksheet

  • We can store different kinds of data in each column,
  • Each column, however, is the same legnth, and only contains a single type of data
cats
   hansi apricot
1   5.65    3.15
2   5.25    3.40
3   5.65    3.20
4   5.35    3.40
5   5.45    3.50
6   5.55    3.35
7   5.40    3.05
8   5.50    3.40
9   5.55    3.25
10  5.25    3.20

Data frames

Now that we have the data in a data frame, we can use these data in models or plots

plot(hansi ~ apricot, data = cats)

Importing data

Importing data

If you have

  • more than 10 observations, or
  • more than a couple of variables, or
  • need to reuse the data in more than a single script

You should store the data in a file and load the data into R

Filetypes

You could store the data in many different ways using a plethora of softwrae applications

The best way to store simple tabular data is in a plain text file — .csv

You can also store your data in the newer Excel workbook format — .xlsx

Older Excel files were binary formats which could not be read by humans

I would recommend using CSV files, but Excel is also acceptable, especially for you own use

Tip

Use Excel to create a file, but save it as CSV

CSV files

CSV stands for comma separated values

In CSV files, the data are stored row-wise, with the values of the different variables separated by a comma

The first few rows of the full set of weights for Hansi and Apricot in CSV format are:

Hansi,Apricot
5.65,
5.25,3.15
5.65,3.4
5.35,3.2
5.45,3.4

Notice that the observation for Apricot is missing for the most recent weighing

CSV files

Often, character strings will be quoted:

"Hansi","Apricot"
5.65,
5.25,3.15
5.65,3.4
5.35,3.2
5.45,3.4

which would allow for fields (individual values) to contain ,, for example

CSV files

One problem with this definition of a CSV file is that some countries use , for the decimal point

In such locales, the semi-colon ; is often used as the field separator

For example, in Denmark, Excel would create a CSV file that looked like this

"Hansi";"Apricot"
5,65;
5,25;3,15
5,65;3,4
5,35;3,2
5,45;3,4

Delimiters

We call the field separators, e.g., , and ; delimiters

Other file types make use of different delimiters

  • Tab characters (\t) for Tab-delimited or TSV files
  • Space delimited

You need to be aware of how the file is delimited before you try to read it

It is a good ideo to open the file in a text editor (not Word) to identify the delimiter used

Importing plain text files

Base R comes with several functions for importing data from plain text files

However, we will use the readr 📦 from the tidyverse to import and export (read & write) plain test files

The package needs to be installed as it is not a standard R package

We load the package, whenever we want to use it, using

library("readr")

Importing CSV files

CSV files are imported using read_csv() or read_csv2() — the latter is for files using ; as the delimiter and , as the decimal separator

To import and CSV file located in the current folder, we can just provide the name of the file as a character vector

cats <- read_csv("hansi-apricot-weights.csv")

If the file is located in a different folder, we need to provide the path to the file.

Importing CSV files

For example if your working directory contains a folder named data and the data file is withing that we would use

cats <- read_csv("./data/hansi-apricot-weights.csv")

where

  • the "./" bit means the current folder,
  • the "data/" bit means go into the data folder in the current folder,

Import the cat weight data

We import the weight data for Hansi and Apricot using

cats <- read_csv("hansi-apricot-weights.csv")
cats
Rows: 31 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): Hansi, Apricot

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 31 × 2
   Hansi Apricot
   <dbl>   <dbl>
 1  5.65   NA   
 2  5.25    3.15
 3  5.65    3.4 
 4  5.35    3.2 
 5  5.45    3.4 
 6  5.55    3.5 
 7  5.4     3.35
 8  5.5     3.05
 9  5.55    3.4 
10  5.25    3.25
# ℹ 21 more rows

Import the cat weight data

It is best practice to provide the expected variable type for each column in the data set

That way, if the data differs from your expectations, readr will complain loudly

We specify the variable types using the col_types argument

cats <- read_csv("hansi-apricot-weights.csv", col_types = "dd")
cats
# A tibble: 31 × 2
   Hansi Apricot
   <dbl>   <dbl>
 1  5.65   NA   
 2  5.25    3.15
 3  5.65    3.4 
 4  5.35    3.2 
 5  5.45    3.4 
 6  5.55    3.5 
 7  5.4     3.35
 8  5.5     3.05
 9  5.55    3.4 
10  5.25    3.25
# ℹ 21 more rows

Now that we have specified the column type, readr is much quieter, and just reads in the data

Importing data from Excel

If you have data in an Excel file, the readxl 📦 can be used

library("readxl")
cats <- read_xlsx("./data/my-cats/hansi-apricot-weights.xlsx")
cats
# A tibble: 31 × 2
   Hansi Apricot
   <dbl>   <dbl>
 1  5.65   NA   
 2  5.25    3.15
 3  5.65    3.4 
 4  5.35    3.2 
 5  5.45    3.4 
 6  5.55    3.5 
 7  5.4     3.35
 8  5.5     3.05
 9  5.55    3.4 
10  5.25    3.25
# ℹ 21 more rows

Importing data from Excel

We can again use col_types to tell read_xlsx() what data types to expect

But the format is different; we use "numeric" to tell read_xlsx() what types of data to expect

cats <- read_xlsx(
  "./data/my-cats/hansi-apricot-weights.xlsx",
  col_types = rep("numeric", 2)
)
cats
# A tibble: 31 × 2
   Hansi Apricot
   <dbl>   <dbl>
 1  5.65   NA   
 2  5.25    3.15
 3  5.65    3.4 
 4  5.35    3.2 
 5  5.45    3.4 
 6  5.55    3.5 
 7  5.4     3.35
 8  5.5     3.05
 9  5.55    3.4 
10  5.25    3.25
# ℹ 21 more rows

Cleaning imported data

Even nicely arranged data, like my cats’ weight data, needs some cleaning to make it easier to work with

Usually we’ll want to clean the variables names to be consistent

We use the janitor 📦 and its clean_names() function

clean_names() turns variable names to lowercase, replaces spaces with _, & others

cats <- cats |>
  janitor::clean_names()
cats
# A tibble: 31 × 2
   hansi apricot
   <dbl>   <dbl>
 1  5.65   NA   
 2  5.25    3.15
 3  5.65    3.4 
 4  5.35    3.2 
 5  5.45    3.4 
 6  5.55    3.5 
 7  5.4     3.35
 8  5.5     3.05
 9  5.55    3.4 
10  5.25    3.25
# ℹ 21 more rows

Creating data files

Creating data files

Many data problems originate from poor choices made at the time the data were entered into an electronic format

  • transcription errors
  • unnecessary formatting
  • unnecessary complexity

Follow the KISS principle: Keep it simple, stupid

Organising your files

Most projects involve multiple stages and data files

Having a nice, clean organisation for your files is important, e.g.,

  • put your data files in a data folder
    • can distinguish between raw-data and data, latter containing processed data files
  • put your R scripts in a scripts or analysis folder
  • have a folder for outputs like figures for any plots you export to disk
  • use a README.md file in your working directory
    • explains what the project is,
    • how the files are structured,
    • any other relevant information

How not to organise your data

It is very easy to end up with a spreadsheet nightmare

Source: Data Carpentry

Don’t use formatting for data

Formatting cells to convey data is not easily readable by a computer

Source: Luis D. Verde Arregoitia

Don’t use formatting for data

If it’s important enough to note, make it actual data

Source: Luis D. Verde Arregoitia

Don’t merge cells

Merging cells might get you a nice table, but it’s hard to read into a computer

Source: Luis D. Verde Arregoitia

Don’t merge cells

Instead, repeat the data so each row / column has the same number of cells

Source: Luis D. Verde Arregoitia

Don’t use subheadings

Any labels go in the first row; these give the variable names

Don’t use subheadings to break data up

Source: Luis D. Verde Arregoitia

Don’t use subheadings

Instead, store the heading as a data; we can use the variable later to filter on or group by

Source: Luis D. Verde Arregoitia

Excel != data analysis

Don’t use Excel for analysis or processing of data

Use it as a data entry tool

Keep your layout simple

  • variable names go in row 1
  • if data are missing, leave the cell blank
  • no formatting without corresponding data
  • no subheadings
  • no spacing — one data set per sheet